Landcover classification¶

Demonstrate how one could use Julia to participate in the OGH2023 hackathon. By Maarten Pronk (@evetion). This is a reproduceable notebook to produce a landcover classification. Challenge is described at kaggle.

Input (landsat 8 scene) is provided. We add:

  • Roughness of B1
  • Ridge detection of B8

Output:

  • Landcover classification

Input data¶

In [1]:
using Pkg
using Rasters, ArchGDAL, Plots, DataFrames
using RemoteSensingToolbox, Rasters
using ImageShow
using Geomorphometry
[ Info: Precompiling RemoteSensingToolbox [c88070b3-ddf6-46c7-b699-196056389566]

Let's retrieve all landsat scene files and load the high resolution file.

In [13]:
folder = "/Users/evetion/Downloads/ogh2023/Landast"  # typo comes from the zipfile
tifs = sort(filter(endswith("TIF"), readdir(folder, join=true)))
b8, stack... = tifs  # split into 1, many
highres = Raster(b8)
plot(highres; max_res=2000, c=:turbo)
Out[13]:

The other lowres bands can be read instantly by using the RemoteSensingToolbox package.

In [14]:
landsat = read_bands(Landsat8, "/Users/evetion/Downloads/ogh2023/Landast")
Out[14]:
RasterStack with dimensions: 
  X Projected{Float64} LinRange{Float64}(246375.0, 307635.0, 2043) ForwardOrdered Regular Intervals crs: WellKnownText,
  Y Projected{Float64} LinRange{Float64}(5.948e6, 5.831e6, 3901) ReverseOrdered Regular Intervals crs: WellKnownText
and 7 layers:
  :B1 UInt16 dims: X, Y (2043×3901)
  :B2 UInt16 dims: X, Y (2043×3901)
  :B3 UInt16 dims: X, Y (2043×3901)
  :B4 UInt16 dims: X, Y (2043×3901)
  :B5 UInt16 dims: X, Y (2043×3901)
  :B6 UInt16 dims: X, Y (2043×3901)
  :B7 UInt16 dims: X, Y (2043×3901)

It has specific methods for Landsat8, such as visualizing combinations of specific bands.

In [15]:
visualize(landsat, Agriculture{Landsat8}; upper=0.90)
Out[15]:

Let's change the underlying values to reflectance

In [16]:
landsat_sr = dn_to_reflectance(Landsat8, landsat)
Out[16]:
RasterStack with dimensions: 
  X Projected{Float64} LinRange{Float64}(246375.0, 307635.0, 2043) ForwardOrdered Regular Intervals crs: WellKnownText,
  Y Projected{Float64} LinRange{Float64}(5.948e6, 5.831e6, 3901) ReverseOrdered Regular Intervals crs: WellKnownText
and 7 layers:
  :B1 Float32 dims: X, Y (2043×3901)
  :B2 Float32 dims: X, Y (2043×3901)
  :B3 Float32 dims: X, Y (2043×3901)
  :B4 Float32 dims: X, Y (2043×3901)
  :B5 Float32 dims: X, Y (2043×3901)
  :B6 Float32 dims: X, Y (2043×3901)
  :B7 Float32 dims: X, Y (2043×3901)

Derive features¶

And then derive several spectral features, based on a 5min Google search.

In [17]:
# From https://yceo.yale.edu/tasseled-cap-transform-landsat-8-oli
function brightness(ls, ::Type{Landsat8})  # we copy the type signature to match RemoteSensingToolbox calls
    b1 = ls[:B1]
    b2 = ls[:B2]
    b3 = ls[:B3]
    b4 = ls[:B4]
    b5 = ls[:B5]
    b6 = ls[:B6]
    b1*0.3029+b2*0.2786+b3*0.4733+b4*0.5599+b5*0.508+b6*0.1872
end
function greenness(ls, ::Type{Landsat8})
    b1 = ls[:B1]
    b2 = ls[:B2]
    b3 = ls[:B3]
    b4 = ls[:B4]
    b5 = ls[:B5]
    b6 = ls[:B6]
    b1*(-0.2941)+b2*(-0.243)+b3*(-0.5424)+b4*0.7276+b5*0.0713+b6*(-0.1608)
end
function wetness(ls, ::Type{Landsat8})
    b1 = ls[:B1]
    b2 = ls[:B2]
    b3 = ls[:B3]
    b4 = ls[:B4]
    b5 = ls[:B5]
    b6 = ls[:B6]
    b1*0.1511+b2*0.1973+b3*0.3283+b4*0.3407+b5*(-0.7117)+b6*(-0.4559)
end
# Normally used for DEMs, but works in a 3x3 window, so takes into account neighbors. Note that this is the only pixel value to do so.
function roughness(ls, ::Type{Landsat8})
    Geomorphometry.roughness(ls[:B1])
end
# From https://www.jetir.org/papers/JETIRAK06043.pdf, trying to find road extraction feature
function rei(ls, ::Type{Landsat8})
    (ls[:B7] .- ls[:B2]) ./ (ls[:B7] .+ ls[:B2])
end
Out[17]:
rei (generic function with 1 method)

We add all these spectral features together in a new RasterStack.

In [18]:
# List of functions
functions = [mndwi, ndbi, ndmi, ndvi, savi, brightness, greenness, wetness, roughness, rei]
# Map each function to the rasterstack, and combine these rasters into a new RasterStack
smrs = RasterStack(
    NamedTuple(
        zip(
            Symbol.(functions), # name of function 
            map(x->x(landsat_sr, Landsat8), functions)  # apply function
        )
    )
)
Out[18]:
RasterStack with dimensions: 
  X Projected{Float64} LinRange{Float64}(246375.0, 307635.0, 2043) ForwardOrdered Regular Intervals crs: WellKnownText,
  Y Projected{Float64} LinRange{Float64}(5.948e6, 5.831e6, 3901) ReverseOrdered Regular Intervals crs: WellKnownText
and 10 layers:
  :mndwi      Float32 dims: X, Y (2043×3901)
  :ndbi       Float32 dims: X, Y (2043×3901)
  :ndmi       Float32 dims: X, Y (2043×3901)
  :ndvi       Float32 dims: X, Y (2043×3901)
  :savi       Float32 dims: X, Y (2043×3901)
  :brightness Float64 dims: X, Y (2043×3901)
  :greenness  Float64 dims: X, Y (2043×3901)
  :wetness    Float64 dims: X, Y (2043×3901)
  :roughness  Float32 dims: X, Y (2043×3901)
  :rei        Float32 dims: X, Y (2043×3901)
In [22]:
plot(smrs[:rei], clim=(-1,1))  # Not sure if this has an effect
Out[22]:

Let's detect ridges, hopefully roadlike things, in the high-resolution image.

In [35]:
using ImageFiltering
σ = 3
out2 = imfilter(highres, Kernel.LoG(σ))  # Laplacian of Gaussians
plot(out2[2500:end, 2500:end-2500], clim=(-50, 0), c=:turbo)
Out[35]:

This seems to do something. But since it ran on the high resolution B8 band, we resample all to the lower-resolution other bands, so sampling is straightforward. Note that sampling (or resampling with a specific reducer) on this higher resolution image might be better, as roads can be small.

In [38]:
lowres = resample(highres; to=smrs)  # this actually calls gdalwarp
ridge = resample(out2*1000; to=smrs)
ridge[isinf.(ridge)] .= 0.0  # Infinite is hard to train on

# Also correct high-res band to reflectance
lowresn = dn_to_reflectance(Landsat8, lowres)  # not sure if this correct (and notice we didn't do anything about the clouds that are in there).
lowresn[isinf.(lowresn)] .= 0.0;

Finally we stack everything together, so we have a stack on which we can train.

In [102]:
crst = RasterStack(landsat_sr..., smrs..., RasterStack((;lowres=lowresn, ridge=ridge))...)
Out[102]:
RasterStack with dimensions: 
  X Projected{Float64} LinRange{Float64}(246375.0, 307635.0, 2043) ForwardOrdered Regular Intervals crs: WellKnownText,
  Y Projected{Float64} LinRange{Float64}(5.948e6, 5.831e6, 3901) ReverseOrdered Regular Intervals crs: WellKnownText
and 19 layers:
  :B1         Float32 dims: X, Y (2043×3901)
  :B2         Float32 dims: X, Y (2043×3901)
  :B3         Float32 dims: X, Y (2043×3901)
  :B4         Float32 dims: X, Y (2043×3901)
  :B5         Float32 dims: X, Y (2043×3901)
  :B6         Float32 dims: X, Y (2043×3901)
  :B7         Float32 dims: X, Y (2043×3901)
  :mndwi      Float32 dims: X, Y (2043×3901)
  :ndbi       Float32 dims: X, Y (2043×3901)
  :ndmi       Float32 dims: X, Y (2043×3901)
  :ndvi       Float32 dims: X, Y (2043×3901)
  :savi       Float32 dims: X, Y (2043×3901)
  :brightness Float64 dims: X, Y (2043×3901)
  :greenness  Float64 dims: X, Y (2043×3901)
  :wetness    Float64 dims: X, Y (2043×3901)
  :roughness  Float32 dims: X, Y (2043×3901)
  :rei        Float32 dims: X, Y (2043×3901)
  :lowres     Float32 dims: X, Y (2043×3901)
  :ridge      Float64 dims: X, Y (2043×3901)

Note that these are many features already, and some don't add much information at all. One could use PCA to reduce the number of bands. Be careful though, as we don't know yet how these features relate to the landcover classes.

Training and test datasets¶

With our features complete, let's load the test and training datasets. We add them together, and combine them with our feature data to create one large DataFrame with all data.

In [106]:
Xr = Raster(joinpath(folder, "..", "training_set.tif"))
Yr = Raster(joinpath(folder, "..", "validation_set.tif"))

# Resample our stack to these datasets
trainrs = RasterStack(resample(crst; to=Xr)..., RasterStack((;class=Xr))...)
valrs = RasterStack(resample(crst; to=Yr)..., RasterStack((;class=Yr))...)

# For debugging, let's write out our input data.
Rasters.write("trainbands.tif", Raster(trainrs), force=true)
Rasters.write("valbinds.tif", Raster(valrs), force=true)

# Calling DataFrame on a RasterStack will transform it into a Table of rows for each cell
traind = DataFrame(trainrs)
vald = DataFrame(valrs)
combined = vcat(traind, vald)

# Drop 0 as class?
subset!(combined, :class => c -> c.!= 0)

# drop row if it contains Inf
combined = combined[any.(eachrow(isinf.(combined))), :]

# drop X Y, as I believe these do not matter so much
select!(combined, Not([:X, :Y]))

combined
Out[106]:
54×20 DataFrame
29 rows omitted
RowB1B2B3B4B5B6B7mndwindbindmindvisavibrightnessgreennesswetnessroughnessreilowresridgeclass
Float32Float32Float32Float32Float32Float32Float32Float32Float32Float32Float32Float32Float64Float64Float64Float32Float32Float32Float64UInt8
1-0.05062-0.01929750.0486550.041010.3239850.1396530.08017-0.483239-0.3975790.3975790.7752850.5415240.2160080.023669-0.275758Inf1.634030.04612561952.34
2-0.0973425-0.050620.040680.03856250.302810.1445750.0905375-0.560821-0.3536890.3536890.7740740.5234790.1781490.045265-0.279624Inf3.536230.052862539320.94
3-0.04787-0.01407250.05467750.052120.3294580.1636320.0990075-0.499084-0.3362980.3362980.7268180.5183680.2346370.0229418-0.283377Inf1.331370.0525636766.34
4-0.156303-0.08436250.02613250.03036750.2991250.1610750.09722-0.720818-0.2999780.2999780.8156710.5420040.1406330.0698165-0.307658Inf14.12270.044422670.984
5Inf-0.187680.001134990.03157750.3048170.2188530.13924-0.989681-0.1641590.1641590.8122590.545336Inf-InfInfNaN-6.748970.04915-62675.14
6-0.104933-0.04272750.06463250.07970250.305340.2352430.17543-0.568937-0.129670.129670.5860070.4196920.2306780.0481221-0.300469Inf1.643960.0711225-53179.44
7-0.15611-0.1162630.0763750.06507250.298740.276630.241732-0.567287-0.03842750.03842750.6422750.4479280.1964510.0569028-0.338012Inf2.853230.12021-69675.92
80.0588850.072360.183020.1817820.3246180.349010.349972-0.3119940.0362106-0.03621060.282060.2271290.45664-0.0348822-0.244951Inf0.6573320.316615-1.71894e52
9-0.131113-0.04270.0956250.132530.2877130.293020.238047-0.5079060.00913931-0.009139310.3692690.2751010.2688640.0668944-0.290042Inf1.437170.16663-1.59984e52
100.0327050.0426050.08085750.068620.3435930.2108770.131815-0.445678-0.2393550.2393550.6670650.4927340.312488-0.0233117-0.277402Inf0.5114670.0565296406.04
110.04070750.052890.09147250.08454250.299070.2370850.160388-0.443187-0.115610.115610.559230.3998270.314005-0.0297255-0.245515Inf0.5040270.072002571462.94
120.043320.055640.093260.094250.2763270.2449770.163908-0.448553-0.06013750.06013750.4913340.3456620.311768-0.0279591-0.228096Inf0.4931390.0666463209.55
13-0.0288125-0.0020550.06908750.05181750.382010.1913530.118478-0.469456-0.3325250.3325250.7611150.5749410.2822940.00567031-0.323538Inf1.03530.054211.11824e54
⋮⋮⋮⋮⋮⋮⋮⋮⋮⋮⋮⋮⋮⋮⋮⋮⋮⋮⋮⋮⋮
43-0.0506475-0.03849250.01780.00924750.2646680.1240870.0612775-0.749097-0.3616160.3616160.9324790.5625110.1452180.0202404-0.251188Inf4.378760.0158475279612.03
44-0.0994325-0.05301250.0486550.0375450.325140.2114270.11779-0.625849-0.2119260.2119260.7929610.5522010.2039130.0322373-0.324511Inf2.636760.02723252.55798e53
450.04868250.08160.2266350.218550.4528780.3642180.266262-0.232854-0.1085060.1085060.3489990.3112110.565355-0.0243322-0.31604Inf0.5308490.2187431.01688e53
460.0522850.06413750.1798580.1823050.4532350.3673250.267968-0.342605-0.1046970.1046970.4262990.3731970.519912-0.0226222-0.348318Inf0.6137520.198118-2.63432e53
470.00605750.03416250.1015650.13880.4318680.319860.225342-0.517993-0.1490.1490.5135520.4327680.4164040.0151777-0.364896Inf0.736710.182827-1.31874e53
48-0.166423-0.156660.050910.0408450.3360850.235490.156125-0.644483-0.1759960.1759960.7832760.5554570.1677250.0752145-0.371977Inf-584.6330.0770921511.03
49-0.172005-0.09728750.049150.04659250.3529430.2363150.14573-0.65565-0.1979230.1979230.7667660.55850.1936780.0686346-0.3721Inf5.016620.0746425-88323.63
50Inf-0.177945-0.00816-0.0186650.2930470.1749630.107505-1.09784-0.2523130.2523131.136050.685952Inf-InfInfNaN-4.052390.0366161693.23
51-0.181245-0.12256-0.00563-0.018720.2623570.1412750.07973-1.08301-0.2999820.2999821.153670.6516890.05753390.0685083-0.310921Inf-4.723090.02514251.59385e53
52-0.055735-0.02765750.033420.023960.2959070.1538150.0916925-0.643015-0.3159560.3159560.8501880.556560.1837610.0187835-0.275465Inf1.863820.040762536287.63
53-0.08318-0.0601350.0022625-0.01566750.2496520.11790.0594625-0.962343-0.3584590.3584591.133920.6256830.09924410.0252911-0.260457Inf-177.8390.01590251.41743e53
54-0.0598875-0.0394550.015380.00204250.2852370.1270850.061965-0.784087-0.3835650.3835650.985780.6101760.1479820.0202467-0.27203Inf4.505550.01414251.6748e53

Note that there is a strong imbalance in the classes. Especially class 5 (undefined, roads and such) is underrepresented.

In [49]:
histogram(cX.class)
Out[49]:

To start training on this (now vector) dataset, we need to set the specific type in Julia. You might consider this odd, but a UInt8 value could be a category or a continuous variable (the default).

In [59]:
using ScientificTypes
cX.class = coerce(cX.class, Multiclass)
describe(cX)  # Useful to check for Inf, NaNs or other odd patterns that might disturb our model.
Out[59]:
20×7 DataFrame
Rowvariablemeanminmedianmaxnmissingeltype
SymbolUnion…AnyUnion…AnyInt64DataType
1B10.0258209-0.195490.0231350.5150550Float32
2B20.0359458-0.1521220.0324850.6038250Float32
3B30.0699633-0.01577750.06848250.6778280Float32
4B40.0621506-0.0298850.0571250.7436080Float32
5B50.3281050.011970.3279180.910560Float32
6B60.1765350.003967490.1704251.189490Float32
7B70.1042050.003692490.09251751.447390Float32
8mndwi-0.422946-2.09225-0.4508650.7026980Float32
9ndbi-0.300673-0.723134-0.3181250.3434960Float32
10ndmi0.300673-0.3434960.3181250.7231340Float32
11ndvi0.675958-15.5080.7033617.113030Float32
12savi0.483615-0.06020570.4853420.8042340Float32
13brightness0.285472-0.08351780.2903821.709160Float64
14greenness-0.014049-0.223667-0.01324130.09492120Float64
15wetness-0.258857-0.584506-0.2615340.02079850Float64
16roughness0.008811410.0001925080.00657250.4600470Float32
17rei0.475715-110.3770.489161045.050Float32
18layer180.0436117-0.009892510.0412850.7919520Float32
19layer19-243.917-9.55663e51406.853.41089e50Float64
20class150CategoricalValue{UInt8, UInt32}
In [52]:
schema(cX)
Out[52]:
┌────────────┬───────────────┬─────────────────────────────────┐
│ names      │ scitypes      │ types                           │
├────────────┼───────────────┼─────────────────────────────────┤
│ B1         │ Continuous    │ Float32                         │
│ B2         │ Continuous    │ Float32                         │
│ B3         │ Continuous    │ Float32                         │
│ B4         │ Continuous    │ Float32                         │
│ B5         │ Continuous    │ Float32                         │
│ B6         │ Continuous    │ Float32                         │
│ B7         │ Continuous    │ Float32                         │
│ mndwi      │ Continuous    │ Float32                         │
│ ndbi       │ Continuous    │ Float32                         │
│ ndmi       │ Continuous    │ Float32                         │
│ ndvi       │ Continuous    │ Float32                         │
│ savi       │ Continuous    │ Float32                         │
│ brightness │ Continuous    │ Float64                         │
│ greenness  │ Continuous    │ Float64                         │
│ wetness    │ Continuous    │ Float64                         │
│ roughness  │ Continuous    │ Float32                         │
│ rei        │ Continuous    │ Float32                         │
│ layer18    │ Continuous    │ Float32                         │
│ layer19    │ Continuous    │ Float64                         │
│ class      │ Multiclass{5} │ CategoricalValue{UInt8, UInt32} │
└────────────┴───────────────┴─────────────────────────────────┘

Modeling¶

Now for the actual ML part (much of all ML time spent is on data handling).

In [61]:
using MLJ
In [62]:
y, x = unpack(cX, ==(:class); rng=123)
Out[62]:
(CategoricalArrays.CategoricalValue{UInt8, UInt32}[0x04, 0x04, 0x03, 0x04, 0x03, 0x04, 0x03, 0x03, 0x04, 0x04  …  0x03, 0x04, 0x03, 0x04, 0x03, 0x03, 0x04, 0x01, 0x03, 0x04], 2121235×19 DataFrame
     Row │ B1          B2         B3         B4         B5         B6          ⋯
         │ Float32     Float32    Float32    Float32    Float32    Float32     ⋯
─────────┼──────────────────────────────────────────────────────────────────────
       1 │ 0.0511025   0.058115   0.0826725  0.0841025  0.301188   0.253613    ⋯
       2 │ 0.0154075   0.0272325  0.06752    0.052395   0.367985   0.146885
       3 │ 0.0128225   0.018075   0.038975   0.027755   0.267445   0.12153
       4 │ 0.0139775   0.02748    0.07489    0.0628175  0.346233   0.146115
       5 │ 0.0226675   0.0392225  0.068235   0.0722225  0.217065   0.195862    ⋯
       6 │ 0.0188725   0.031935   0.0638075  0.061525   0.263017   0.132805
       7 │ 0.0234925   0.029625   0.0493975  0.04486    0.214095   0.13715
       8 │ 0.0266275   0.0271225  0.043705   0.0259125  0.460523   0.168747
       9 │ 0.0418075   0.0454925  0.0672725  0.0555025  0.342602   0.202683    ⋯
      10 │ 0.0405425   0.0568225  0.105277   0.100988   0.396338   0.22823
      11 │ 0.0285525   0.0409825  0.0800325  0.0764575  0.308613   0.185577
    ⋮    │     ⋮           ⋮          ⋮          ⋮          ⋮          ⋮       ⋱
 2121226 │ 0.03045     0.0330075  0.053825   0.03859    0.323435   0.15959
 2121227 │ 0.0140325   0.0303675  0.0699675  0.066255   0.362953   0.145922    ⋯
 2121228 │ 0.04596     0.0631475  0.10041    0.123757   0.342245   0.321813
 2121229 │ 0.0194775   0.036115   0.0755775  0.0929025  0.262055   0.195422
 2121230 │ 0.0114475   0.0198625  0.045025   0.0368025  0.207028   0.117378
 2121231 │ 0.0314675   0.04761    0.0876225  0.097825   0.310235   0.269205    ⋯
 2121232 │ 0.02143     0.0330075  0.0779425  0.059985   0.381707   0.165063
 2121233 │ 0.0129875   0.02363    0.054815   0.04299    0.0736525  0.03606
 2121234 │ 0.008175    0.01483    0.034685   0.0256925  0.21514    0.107808
 2121235 │ 0.016975    0.029295   0.0760175  0.05927    0.342685   0.160938    ⋯
                                             13 columns and 2121214 rows omitted)
In [63]:
Pkg.add("MLJXGBoostInterface")
Tree = @load XGBoostClassifier pkg=XGBoost
tree = Tree()  # no hypertuning yet
   Resolving package versions...
  No Changes to `~/code/Project.toml`
  No Changes to `~/code/Manifest.toml`
import MLJXGBoostInterface ✔
[ Info: For silent loading, specify `verbosity=0`. 
Out[63]:
XGBoostClassifier(
  test = 1, 
  num_round = 100, 
  booster = "gbtree", 
  disable_default_eval_metric = 0, 
  eta = 0.3, 
  num_parallel_tree = 1, 
  gamma = 0.0, 
  max_depth = 6, 
  min_child_weight = 1.0, 
  max_delta_step = 0.0, 
  subsample = 1.0, 
  colsample_bytree = 1.0, 
  colsample_bylevel = 1.0, 
  colsample_bynode = 1.0, 
  lambda = 1.0, 
  alpha = 0.0, 
  tree_method = "auto", 
  sketch_eps = 0.03, 
  scale_pos_weight = 1.0, 
  updater = nothing, 
  refresh_leaf = 1, 
  process_type = "default", 
  grow_policy = "depthwise", 
  max_leaves = 0, 
  max_bin = 256, 
  predictor = "cpu_predictor", 
  sample_type = "uniform", 
  normalize_type = "tree", 
  rate_drop = 0.0, 
  one_drop = 0, 
  skip_drop = 0.0, 
  feature_selector = "cyclic", 
  top_k = 0, 
  tweedie_variance_power = 1.5, 
  objective = "automatic", 
  base_score = 0.5, 
  watchlist = nothing, 
  nthread = 8, 
  importance_type = "gain", 
  seed = nothing, 
  validate_parameters = false, 
  eval_metric = String[])
In [64]:
mach = machine(tree, x, y)
Out[64]:
untrained Machine; caches model-specific representations of data
  model: XGBoostClassifier(test = 1, …)
  args: 
    1:	Source @371 ⏎ Table{AbstractVector{Continuous}}
    2:	Source @656 ⏎ AbstractVector{Multiclass{5}}
In [65]:
pe = evaluate!(mach, resampling=StratifiedCV(; nfolds=6,
                               shuffle=true),
                 measures=[balanced_accuracy, kappa],
                 verbosity=1)
Evaluating over 6 folds: 100%[=========================] Time: 0:48:19
Out[65]:
PerformanceEvaluation object with these fields:
  measure, operation, measurement, per_fold,
  per_observation, fitted_params_per_fold,
  report_per_fold, train_test_rows
Extract:
┌─────────────────────┬──────────────┬─────────────┬─────────┬──────────────────
│ measure             │ operation    │ measurement │ 1.96*SE │ per_fold        ⋯
├─────────────────────┼──────────────┼─────────────┼─────────┼──────────────────
│ BalancedAccuracy(   │ predict_mode │ 0.604       │ 0.00147 │ [0.605, 0.601,  ⋯
│   adjusted = false) │              │             │         │                 ⋯
│ Kappa()             │ predict_mode │ 0.807       │ 0.00052 │ [0.807, 0.806,  ⋯
└─────────────────────┴──────────────┴─────────────┴─────────┴──────────────────
                                                                1 column omitted
In [66]:
pe.measurement
Out[66]:
2-element Vector{Float64}:
 0.604350054411069
 0.8067220895407315

These results look okaish, but can certainly be improved. Ideally one would:

  • Reduce training time (by reducing features)
  • Inspect feature importance
  • Inspect lowest scoring classes (undeveloped class 5) in relation to features

Prediction¶

With our trained model, let's first we create a prediction map to inspect our results. We create a DataFrame of our raster stack and input it to the model to predict values for our whole landsat scene.

In [78]:
# Make a prediction map
dfn = DataFrame(crst)
select!(dfn, Not([:X, :Y]))
dfn .= ifelse.(isinf.(dfn), 0.0, dfn)  # set any non-finite values to 0
pred = predict_mode(mach, dfn)
Out[78]:
7969743-element CategoricalArrays.CategoricalArray{UInt8,1,UInt32}:
 0x01
 0x03
 0x03
 0x03
 0x03
 0x03
 0x03
 0x03
 0x03
 0x03
 0x03
 0x03
 0x03
 ⋮
 0x03
 0x03
 0x03
 0x03
 0x03
 0x03
 0x03
 0x03
 0x03
 0x03
 0x03
 0x03

We can reshape this vector of predictions into the raster form (the inverse of the DataFrame operation).

In [79]:
pr = Raster(int.(reshape(pred, size(srs)[1:2])), dims=dims(srs))
Rasters.write("prediction.tif", pr, force=true)
plot(pr)
┌ Warning: `missing` cant be written to .tif, missinval for `UInt32` of `4294967295` used instead
└ @ Rasters ~/.julia/packages/Rasters/47MWf/src/utils.jl:32
Out[79]:

Submission¶

Let's load the file with the points to be predicted and create our submission.csv.

In [68]:
using CSV
df = CSV.read(joinpath(folder, "..", "submission.csv"), DataFrame)
geometry = zip(df.Y, df.X)  # make an (implicit) point that Rasters can use for sampling
df
Out[68]:
11089×4 DataFrame
11064 rows omitted
RowIDXYcategory
Int64Float64Float64String3
114.09131e5532574.0NA
224.18181e5540287.0NA
334.33303e55.65382e5NA
444.01445e5558185.0NA
554.3286e55.49954e5NA
664.22147e55.5111e5NA
774.20718e55.64889e5NA
884.00312e55.50528e5NA
994.09671e55.35409e5NA
10104.14793e55.30543e5NA
11114.24709e55.75715e5NA
12124.09487e5551657.0NA
1313401917.05.58433e5NA
⋮⋮⋮⋮⋮
11078110784.26625e55.72455e5NA
11079110794.19039e55.51605e5NA
11080110804.01561e55.45422e5NA
11081110814.10687e55.53751e5NA
11082110824.12019e55.48915e5NA
11083110834.00542e55.43265e5NA
11084110844.28797e55.59852e5NA
11085110854.22043e55.6347e5NA
11086110864.20962e5565286.0NA
11087110874.00102e55.45037e5NA
11088110884.22388e5555861.0NA
11089110894.2023e55.57456e5NA

Sadly, not all data is in the same coordinate reference system, so here I project all submission points to the one of the raster. This is faster than the other way around.

In [83]:
using Proj
trans = Proj.Transformation("EPSG:2180", "EPSG:32634")
np = trans.(geometry)
Out[83]:
11089-element Vector{Tuple{Float64, Float64}}:
 (273835.179029267, 5.838726329239982e6)
 (283104.33198141155, 5.846191084579964e6)
 (298930.12895827554, 5.8708747066642335e6)
 (266859.24824864336, 5.86456275069286e6)
 (298057.37145709933, 5.855453116043926e6)
 (287372.51424354216, 5.856907964241545e6)
 (286327.3248015798, 5.870732207417169e6)
 (265512.49169201485, 5.856933581449814e6)
 (274454.3807359721, 5.841547744245808e6)
 (279443.4042761496, 5.83653763368146e6)
 (290621.67190626595, 5.88145103838913e6)
 (274722.91540568986, 5.857807774409168e6)
 (267338.0348033956, 5.864797326644369e6)
 ⋮
 (292447.9119871984, 5.878136110868777e6)
 (284277.1685591929, 5.857489956981629e6)
 (266619.4299403647, 5.851790333031429e6)
 (275981.736765133, 5.859869547691904e6)
 (277178.92529000354, 5.854994545333261e6)
 (265540.09238937247, 5.849661264884725e6)
 (294268.81199749594, 5.865467640499927e6)
 (287612.73477495473, 5.869276217833487e6)
 (286582.0663878548, 5.871122634843147e6)
 (265149.3522472769, 5.851445841857978e6)
 (287746.37921305536, 5.861654154152961e6)
 (285632.0698756116, 5.863309473601587e6)

Now we use these points to sample our Rasterstack crst and dump the geometry column (which our model can't use). We again predict the classification.

In [84]:
dfn = DataFrame(extract(crst, np))
select!(dfn, Not(:geometry))
pred = predict_mode(mach, dfn)
Out[84]:
11089-element CategoricalArrays.CategoricalArray{UInt8,1,UInt32}:
 0x03
 0x04
 0x04
 0x04
 0x04
 0x03
 0x03
 0x04
 0x03
 0x03
 0x04
 0x04
 0x04
 ⋮
 0x04
 0x04
 0x04
 0x04
 0x04
 0x04
 0x04
 0x04
 0x03
 0x04
 0x04
 0x04
In [85]:
# In our submission, the classes must be strings, so we convert here
class = ["water", "buildings", "vegetation", "cropland", "undeveloped"]
classes = class[int.(pred)]

# New dataframe for submission
cdf = DataFrame(ID=df.ID, category=classes)  # Kaggle doesn't accept any other columns

# Update input dataframe
df.category .= classes;
In [86]:
CSV.write("for_submission.csv", cdf)  # for submission
CSV.write("to_inspect.csv", df)  # for inspection (with x, y)
Out[86]:
"to_inspect.csv"

Further reading¶

The MLJ manual will help you explore further ML options in Julia.